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SUMMARY 


Ratings  are  often  the  sole  source  of  information  about  job 
performance.  However,  they  are  not  objective  measures;  ratings 
can  be  Invalid  or  contain  Inaccuracies.  Research  designs  must  be 
used  to  Isolate  the  factors  that  distort  the  ratings,  and 
subsequently,  to  improve  the  quality  of  the  ratings.  The 
multitrait-multlmethod  and  person  perception  designs  have  been 
used  to  Isolate  such  factors.  The  goal  of  the  present  research 
was  to  develop  a  design  that  combined  both  the  multitrait- 
multlmethod  and  person  perception  designs.  Examples  were 
presented  to  Illustrate  the  combination  design,  and  it  was  used 
to  isolate  the  Influence  of  rater,  ratee,  and  context  factors  on 
the  validity  and  accuracy  of  performance  ratings.  It  was 
recommended  that  the  combination  design  be  used  in  future 
research  to  improve  performance  ratings. 
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PREFACE 


This  research  was  conducted  under  the  USAF  -  SCEEE  Sumner 
Faculty  Program  sponsored  by  the  Air  Force  Office  of  Scientific 
Research.  The  work  was  accomplished  by  the  author  while  working 
in  the  Manpower  and  Personnel  Division,  Air  Force  Human  Resources 
Laboratory.  It  complements  the  efforts  of  the  Productivity  and 
Performance  Measurement  Function  which  Is  working  on  a  long-term 
job  performance  criterion  development  project. 
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PERFORMANCE  RATINGS:  DESIGNS  FOR  EVALUATING 
THEIR  VALIDITY  AND  ACCURACY 

I.  INTRODUCTION 

Performance  ratings  are  an  Important  method  for  measuring 
and  defining  human  attributes.  They  have  been  used  in  research 
and  applied  contexts  to  describe  a  diversity  of  human  attributes 
such  as  group  leadership  skills,  problem-solving  ability  and 
Interpersonal  skills.  In  some  contexts,  performance  ratings 
serve  as  substitutes  for  more  objective  but  expensive  methods 
such  as  work  sample  testing,  whereas  In  other  contexts,  ratings 
are  the  only  practical  measures  of  attributes. 

Despite  the  utility  of  performance  ratings,  they  must  be 
Interpreted  with  caution.  Since  they  require  human  judgments, 
performance  ratings  are  fallible  measures.  Several  distortions 
In  ratings  have  been  Identified  that  Illustrate  their 
fallibility.  Including  leniency,  halo,  and  similarity  errors 
(Landy  A  Farr,  1980).  Such  distortions  limit  Inferences  about 
human  attributes  and  the  amounts  of  those  attributes  possessed  by 
the  Individuals  who  are  rated. 

The  limitations  to  Inferences  have  been  addressed  with 
research  Into  the  validity  and  accuracy  of  ratings  (DeCotlls  4 
Petit,  1978;  Saal ,  Downey,  4  Lahey,  1980).  The  validity  of 
ratings  has  been  Investigated  with  a  multi trait-multi method 
design  (Boruch,  Larkin,  Violins,  4  MacKInney,  1970).  The  purpose 
In  using  such  a  design  was  to  evaluate  performance  ratings 
against  criteria  that  are  logical  requirements  for  measures  of 
human  attributes.  In  particular,  variance  components  and 
Intraclass  correlations  were  computed  to  evaluate  the  individual 
differences  In  performance  accounted  for  by  the  ratings.  The 
accuracy  of  ratings  has  been  investigated  with  a  person 
perception  design  (Cronbach,  1955).  The  purpose  In  using  such  a 
design  was  to  compare  performance  ratings  against  target  ratings 
that  have  been  specified  by  the  investigator  for  the  research 
context.  In  this  design,  accuracy  statistics  were  computed  to 
describe  several  discrepancies  between  the  performance  and  target 
ratings. 

Research  on  the  validity  of  ratings  was  stimulated  by 
Lawler's  (1967)  application  of  a  multi  trait-multi method  design. 

He  emphasized  that  several  sources  are  available  for  obtaining 
ratings  (e.g.,  supervisors,  peers,  and  the  self)  and  that  these 
sources  may  differ  In  their  ratings  of  performance.  Lawler 
encouraged  the  application  of  a  mul tltralt-mul tl method  approach 
to  compare  ratings  from  several  sources.  Subsequent  research  has 
used  a  mul tltralt-mul tlmethod  design  to  investigate  formats  for 


obtaining  ratings  (Burnaska  &  Hollmann,  1974),  the  nature  of 
human  attributes  (Borman  &  Dunnette,  1975),  and  rater  training 
(Borman,  1978). 

Research  on  the  accuracy  of  performance  ratings  has  focused 
on  the  effects  of  rater  training  (Bernardln  &  Pence,  1980; 

Borman,  1977,  1979a;  Hedge,  1982;  McIntyre,  Smith,  &  Hassett, 
1984).  Borman  (1977,  1979a)  Introduced  the  person  perception 
design  to  assess  training  to  avoid  leniency  and  halo  errors.  He 
found  that  an  admonishment  to  avoid  these  distortions  was 
successful,  but  accuracy  was  not  Improved.  Apparently,  raters 
learned  to  avoid  certain  distortions  but  not  how  to  rate 
accurately.  Other  studies  have  addressed  the  relationship 
between  the  accuracy  of  ratings  and  rater  attributes  such  as 
personality.  Interests  and  observational  skills  (Borman,  1979b; 
Murphy,  Garcia,  Kerkar,  Martin,  &  Bal zer,  1982). 

Although  several  factors  have  been  Investigated  as 
determinants  of  the  validity  and  accuracy  of  ratings  (cf. 

DeCotlls  &  Petit,  1978;  Landy  &  Farr,  1980),  no  comparison  has 
been  made  of  their  Influence  on  both  validity  and  accuracy.  The 
research  has  compared  ratings  against  criteria  for  describing 
individual  differences  In  performance  or  against  target  ratings 
that  specify  appropriate  performance.  These  factors  should  be 
Included  in  a  research  design  that  assesses  their  joint  Influence 
on  the  validity  and  on  the  accuracy  of  ratings.  Such  a  design 
would  employ  a  multifactor  approach  to  Investigate  the  limits 
that  the  factors  place  on  Inferences  about  human  attributes. 

II.  OBJECTIVES  OF  THE  RESEARCH  EFFORT 

The  goal  of  this  research  effort  was  to  develop  a  design  to 
guide  Investigations  of  both  the  validity  and  accuracy  of 
ratings.  The  derived  design  combined  the  multi  trait-multimethod 
and  person  perception  designs  and  utilized  the  procedures  of 
analysis  of  variance.  Prior  to  the  presentation  of  the 
combination  design,  the  multi  trait-multimethod  and  person 
perception  designs  will  be  described  to  provide  a  background. 
Examples  will  be  discussed  that  Illustrate  all  the  designs. 

III.  VALIDITY  OF  PERFORMANCE  RATINGS 

Performance  ratings  measure  attributes  that  are  assumed  to 
account  for  performance  differences  among  Individuals.  Although 
the  attributes  are  Identified  and  operationally  defined  through 
established  scientific  methodologies  such  as  job  analysis  and 
criterion  development  (McCormick,  1976;  Smith,  1976),  the 
assumption  should  be  questioned  in  most  rating  contexts.  Job 
analysis  and  criterion  development  may  produce  attributes  that 


are  poorly  defined,  irrelevant,  or  redundant  with  other 
attributes,  and  the  performance  ratings  for  such  attributes  will 
reflect  no  meaningful  differences  in  individual  performance. 

Multi  trait-multi  method  validation  is  a  research  strategy  for 
assessing  the  individual  differences  accounted  for  by  performance 
ratings  (Kavanagh,  MacKInney,  &  Wolins,  1971).  In  this  strategy, 
a  rating  measure  is  defined  as  a  trait-method  unit.  A  trait  is 
conceived  as  a  human  attribute  that  is  conceptually  and 
statistically  distinct  from  other  attributes  accounting  for 
performance.  Some  examples  of  attributes  Include  ability  to 
facilitate  group  discussions,  define  acceptable  work  procedures, 
and  provide  negative  feedback  to  others.  A  method  is  a  procedure 
for  operationally  defining  traits.  Some  methods  include  forced- 
choice  scales,  checklist  scales,  and  example- anchored  scales.  In 
sum,  a  rating  measure  taps  a  particular  trait  with  a  particular 
methodol ogy . 

The  trait-method  combinations  in  a  research  study  are 
determined  by  the  rating  context.  This  context  is  dictated  by 
the  Interests  of  the  researcher  and  the  nature  of  performance. 

For  example,  a  researcher  may  use  job  analysis  to  define  the 
traits  that  significantly  affect  the  performance  of  jet  engine 
mechanics  for  a  commercial  airline  company,  decide  to  measure 
that  performance  with  two  formats  for  ratings,  and  obtain  ratings 
of  the  mechanics  by  their  immediate  supervisor.  Thus,  the 
researcher  "designs"  the  mul titrait-mul ti method  Investigation. 


Basic  Oesif 


Analysis  of  variance  has  been  used  to  analyze  the  ratings 
from  a  multi  trait-multi  method  Investigation  (Boruch  et  al., 

1970).  The  basic  design  includes  the  three  factors  of  ratees, 
traits,  and  methods.  As  shown  in  Table  1,  the  variation  in 
ratings  Is  partitioned  into  seven  sources.  The  researcher  is  not 
concerned  with  all  of  the  sources  of  variation  in  the  analyses. 
The  fixed  effects  of  Methods,  Traits,  and  Methods  x  Traits  are 
usually  based  on  scales  of  convenience  to  the  investigator  and 
provide  no  information  about  validity.  For  example,  two  methods 
may  differ  because  one  method  employs  5-point  scales  while  the 
other  employs  9-point  scales,  and  two  traits  may  differ  because 
one  is  more  socially  desirable.  In  contrast,  the  random  effects 
of  Ratees,  Ratees  x  Traits,  Ratees  x  Methods,  and  Error  provide 
Information  about  the  validity  of  the  measures.  These  sources 
allow  inferences  about  the  individual  differences  among  ratees. 


The  Ratees  source  of  variation  indicates  the  ability  of  the 
measures  to  order  the  ratees.  This  ordering  can  be  due  to  either 
traits  or  methods,  or  both.  Of  course,  the  more  the  measures 
agree  in  their  ordering  of  ratees,  the  more  the  measures  describe 
individual  differences  among  the  ratees.  The  Ratees  source  of 
variation  is  said  to  reflect  the  convergent  validity  of  the 
measures. 

The  Ratees  x  Traits  interaction  indicates  differential 
ordering  of  the  ratees  by  the  traits.  Since  the  traits  should 
reflect  different  aspects  of  performance,  the  interaction  is 
desirable.  In  fact,  the  stronger  the  interaction,  the  greater 
the  number  of  distinct  discriminations  between  the  ratees  with 
the  traits.  The  Ratees  x  Traits  source  of  variation  reflects  the 
discriminant  validity  of  the  measures. 

The  Ratees  x  Methods  interaction  indicates  differential 
ordering  of  the  ratees  with  the  methods.  This  differential 
ordering  is  undesirable.  The  methods  for  rating  should  not 
influence  the  ordering  of  ratees.  Only  the  traits  should 
determine  the  ordering  of  ratees.  The  Ratees  x  Methods  source  of 
variation  reflects  the  method  bias  of  the  measures. 

The  Error  source  indicates  residual  variation  due  to 
sampling  and  measurement  errors.  The  size  of  this  effect 
relative  to  the  remaining  sources  of  variation  suggests  the 
extent  of  differences  between  the  ratees  that  cannot  be  accounted 
for  by  the  traits  and  the  methods. 

The  Error  mean  square  may  be  used  to  compute  F-ratios  to 
establish  statistical  significance  for  the  remaining  sources. 
However,  the  F-ratios  are  based  on  mean  squares  with  large 
degrees  of  freedom,  and  the  critical  F-values  to  establish 
significance  are  frequently  exceeded  when  the  differences  are  not 
practically  significant.  A  more  appropriate  strategy  for 
assessing  the  relative  variation  in  ratings  explained  by  the 
sources  is  to  compare  variance  components.  These  components 
provide  a  comparison  of  the  relative  sizes  of  convergent 
validity,  discriminant  validity,  method  bias,  and  error,  while 
controlling  for  degrees  of  freedom.  For  a  single  research  study, 
the  variance  components  may  be  compared  directly.  However,  since 
the  variance  component  due  to  Error  would  differ  from  study  to 
study,  comparisons  of  the  variance  components  from  several 
studies  is  not  appropriate.  Rather,  ratios  of  the  variance 
components  are  formed  to  generate  intraclass  correlation 
coefficients.  These  ratios  are  expressed  as  a  source's  component 
divided  by  the  sun  of  all  variance  components.  Each  ratio 
reflects  the  proportion  of  variance  accounted  for  by  that  source 


relative  to  the  variation  accounted  for  by  all  sources. 

Computations 

The  computations  associated  with  a  mul tltralt-mul tlmethod 
design  may  be  accomplished  In  several  ways.  First,  the 
computations  may  be  conducted  11 recti y  on  the  ratings  that  are 
obtained  In  the  Investigation.  These  computations  use  the  sum  of 
squares  formulas  that  are  traditionally  employed  In  the  analysis 
of  variance  (e.g.,  Kirk,  1968,  pp.  239-240). 

Another  computational  strategy  Involves  the  use  of  the 
variance-covariance  matrix  among  the  measures  (Stanley,  1961). 
This  matrix  can  be  used  to  compute  the  sum  of  squares  for  the 
various  random  effects  of  Interest  In  the  multi  trait-multi  method 
Investigation.  This  computational  strategy  has  the  advantage  of 
displaying  the  contributions  of  each  of  the  measures  to  the 
ordering  of  the  ratees.  It  Is  directly  related  to  the  use  of  the 
correlational  matrix  among  the  measures  In  a  multi  trait- 
multimethod  Investigation  (Campbell  4  Fiske,  1959;  Kavanagh, 
MacKInney,  4  Wolins,  1971). 

Exampl  e 

An  Issue  of  research  In  performance  measurement  Is  the 
choice  of  a  method  for  obtaining  performance  ratings  (Schwab, 
Heneman,  4  DeCotlis,  1975).  All  methods  are  not  equally 
desirable.  Methods  should  be  compared  In  terms  of  the  individual 
differences  that  each  accounts  for  In  the  ratings.  One  method  Is 
preferred  to  others  if  the  ratings  that  are  obtained  with  that 
method  display  more  discriminant  validity  and  less  bias  In 
ordering  the  ratees. 

As  an  Illustration,  suppose  an  investigator  needs  rating 
scales  for  research  on  the  performance  of  test  administrators. 

The  Investigator  has  defined  three  traits  and  collected  the  items 
for  constructing  the  rating  scales  (e.g.,  Dickinson  4  Zelllnger, 
1980).  However,  the  Investigator  still  needs  to  specify  a  method 
for  obtaining  the  ratings.  The  decision  regarding  a  method  has 
been  narrowed  to  a  choice  between  exampl e- anchored  scales 
(Taylor,  1968)  and  checklist  scales  (Landy  4  Trtinbo,  1980).  To 
aid  In  decision  making,  the  Investigator  has  collected  data  In  a 
mul tltrait-mul tlmethod  design.  The  data  are  displayed  in  Table 
2.  The  analysis  of  the  data  that  were  used  by  the  Investigator 
In  making  this  choice  Is  presented  below. 


The  data  were  collected  from  a  group  of  raters  who  viewed 
videotapes  of  10  test  administrators  who  were  played  by  actors 
according  to  10  scripts  of  performance.  The  tests  that  were 
given  by  the  administrators  were  the  same  in  each  videotape. 
However,  the  performance  of  the  administrators  on  the  traits 
varied  across  the  videotapes.  The  group  of  raters  viewed  each 
tape,  discussed  the  performance  of  that  administrator,  and  rated 
performance  on  each  of  three  traits  using  the  example-anchored 
and  checkl  1st  methods  for  rating. 

The  investigator  employed  a  traditional  formulation  in 
conducting  an  analysis  of  the  variation  in  the  ratings.  A 
suimary  of  the  analysis  is  shown  in  Table  3.  The  variance 
components  and  intraclass  correlation  coefficients  indicate  that 
the  measures  can  be  used  to  order  ratees  with  substantial 
validity  and  with  little  bias  due  to  the  -ethod  for  rating. 
Convergent  validity  and  discriminant  validity  account  for 
approximately  two- thirds  of  the  variation  that  determines  the 
ordering  of  ratee  performance.  The  exampl  e- anchored  and 
checklist  methods  for  rating  have  little  influence  on  the 
ordering  of  ratees.  Both  are  equally  desirable,  and  the 
investigator  can  choose  either  of  the  two  methods  on  the  basis  of 
the  results.  Additional  research  or  practical  considerations 
must  guide  the  choice  between  the  two  methods. 


Beyond  the  Basic  Design 

The  basic  design  for  a  mul  tit  r  alt-  mul  timet  hod  Investigation 
can  be  expanded  to  research  the  factors  that  distort  the  validity 
of  ratings.  Several  theoretical  models  are  available  to  guide 
such  research  (DeCotils  A  Petit,  1978;  Landy  A  Farr,  1980;  Wherry 
A  Bartlett,  1982).  The  models  describe  factors  ranging  from  the 
ability  and  motivation  of  raters  to  organizational  policies 
concerning  the  use  and  purpose  of  the  ratings. 

To  expand  the  example,  suppose  the  investigator  decomposes 
the  decision  from  a  choice  between  two  methods  to  a  choice 
between  two  methods  that  can  be  used  to  collect  ratings  for  two 
purposes.  The  investigator  has  collected  data  from  two  groups 
with  the  basic  mul  tftrait-mul timethod  design.  One  group  was  told 
that  the  purpose  for  the  ratings  is  to  research  the  validity  of 
the  tests  for  selecting  new  employees,  while  the  second  group  was 
told  that  the  purpose  is  to  motivate  employees  by  rewarding  or 
punishing  them  for  their  past  performance.  Finally,  the 
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investigator  collected  only  five  ratings  from  each  group. 

Suppose  the  videotapes  were  randomly  assigned  to  the  two  groups 
such  that  the  research  group  viewed  videotapes  one  through  five, 
while  the  motivation  group  viewed  the  remaining  videotapes. 

A  four- factor  design  was  used  to  analyze  the  ratings  that 
were  collected  by  the  Investigator  (cf.  Winer,  1971,  pp.  539- 
546).  The  design  has  factors  of  Purposes,  Ratees  nested  within 
Purposes,  Traits,  and  Methods.  The  psychometric  interpretations 
for  the  sources  of  variation  are  summarized  in  Table  4. 


The  expanded  multitrait-multimethod  design  Includes  both 
fixed  and  random  effects.  As  with  the  basic  design,  only  the 
random  effects  In  the  expanded  design  provide  inferences  about 
Individual  differences  among  ratees.  For  example,  the  Ratees 
within  Purposes  effect  represents  the  ability  of  the  measures  to 
order  the  ratees  In  a  manner  that  Includes  both  purposes  for  the 
ratings.  This  effect  is  a  pooling  of  the  Ratee  effects  available 
from  the  two  purposes  for  ratings.  This  pooled  effect  Includes 
variation  due  to  convergent  validity  and  the  Interaction  of 
convergent  validity  with  purpose  for  the  ratings.  Unfortunately, 
the  nesting  of  ratees  prevents  separating  the  variation  due  to 
convergent  validity  from  Its  Interaction.  The  decision  by  the 
Investigator  to  design  the  research  with  ratees  nested  within 
purpose  groups  produces  this  confounding.  Similarly,  the 
variation  due  to  discriminant  validity  and  method  bias  cannot  be 
separated  from  their  Interaction  with  purpose  for  the  ratings. 

A  summary  of  the  analysis  is  shown  In  Table  5.  The  expanded 
research  design  suggests  that  purpose  for  the  ratings  has  little 
Influence  on  the  multi tralt-multlmethod  properties  of  the 
ratings.  Convergent  and  discriminant  validity  again  account  for 
substantial  differences  In  the  ratings  of  performance.  Little 
method  bias  Is  present;  both  methods  of  rating  are  equally 
desirable.  Purpose  for  the  ratings  influences  only  the  raters' 
use  of  the  scales  to  describe  amounts  of  the  attributes.  In 
particular,  the  ratees  were  rated  higher  on  trait  number  two  when 
the  purpose  for  the  ratings  was  research  than  when  the  purpose 
was  for  motivation.  Since  that  trait  reflects  the  "warmth" 
versus  "coldness"  of  the  administrator,  the  Investigator  suspects 
that  the  raters  valued  this  attribute  highly  In  test 
administrators  and  believes  that  the  raters  may  have  been 
emphasizing  high  standards  of  rapport  with  the  examinees  on  the 
part  of  the  test  administrators. 


IV.  ACCURACY  OF  PERFORMANCE  RATINGS 


Accuracy  statistics  have  been  described  as  the  most 
appropriate  criteria  for  assessing  the  distortions  in  ratings 
(Borman,  1978).  Although  other  statistics  are  available,  most 
lack  a  meaningful  standard  for  defining  distortions  (Saal  et  al., 
1980).  In  contrast,  the  person  perception  design  for 
investigating  accuracy  requires  the  development  of  a  standard. 
The  standard  Is  usually  a  set  of  target  ratings  that  specifies 
the  performance  scores  of  ratees  on  several  attributes. 

Target  ratings  can  be  developed  from  the  judgments  of 
experts  or  other  decision-making  groups  or  from  objective 
measures.  For  example,  psychologists  have  rated  the  performance 
of  actors  as  displayed  In  videotapes.  These  expert  ratings  were 
averaged  to  define  the  target  ratings  (Borman,  1979b). 
Supervisory  ratings  of  performance  have  also  been  used  to  define 
target  ratings  in  assessing  the  acciracy  of  self-ratings  (Mabe  A 
West,  1982).  Finally,  life  history  information  and  paper-and- 
pencll  tests  have  been  used  as  objective  measures  to  develop 
target  ratings  for  assessing  the  accuracy  of  ratings  of 
interviewee  performance  (Cline  &  Richards,  1960). 

Cronbach' s  Formul  ati on 


The  overall  accuracy  of  a  rater  Is  defined  as  the  sun  of 
squared  discrepancies  between  the  rater's  performance  ratings  and 
the  target  ratings  for  the  ratees.  Cronbach  (1955)  argued 
convincingly  that  overall  accuracy  should  be  broken  down  into 
four  statistics  that  are  mathematically  independent  components  of 
overall  accuracy. 

Elevation  is  the  component  of  accuracy  due  to  the  mean  of 
the  performance  ratings  given  by  a  rater  for  the  group  of  ratees 
and  the  set  of  attributes.  The  rater  whose  mean  Is  close  to  that 
of  the  target  ratings  will  tend  to  rate  the  performance  of  the 
ratees  more  accurately.  Although  Cronbach  (1955)  stated  that 
elevation  describes  the  way  a  rater  uses  the  rating  scale,  this 
statistic  Is  useful  for  describing  the  accuracy  of  the  rater  in 
judging  the  overall  performance  of  a  group  of  ratees  (Murphy  et 
al.,  1982). 

Differential  elevation  is  the  component  of  accuracy 


assocl ated  wl th  the  mean  ratings  that  a  rater  gives  the  ratees  on 
the  set  of  attributes.  In  some  rating  contexts,  these  mean 
ratings  for  the  set  could  Indicate  the  overall  job  performance  of 
ratees.  This  component  of  accuracy  reflects  a  rater's  ability  to 
order  ratees  In  comparison  to  their  overall  differences  as 
specified  by  the  means  of  their  target  ratings.  Murphy  et  al . 
(1982)  suggest  that  this  component  of  accuracy  Is  Important  for 
administrative  decisions.  For  example,  a  supervisor  Is  often 
required  to  nominate  subordinates  for  training  programs  or  to 
choose  one  for  promotion. 

Stereotype  accuracy  reflects  the  accuracy  of  a  rater  In 
using  the  attributes  to  describe  the  group  of  ratees.  The  mean 
ratings  on  the  attributes  given  by  the  rater  to  the  group  are 
compared  to  the  mean  ratings  given  to  the  group  by  the  expert 
source.  This  component  of  accuracy  is  Important  In  making 
administrative  decisions.  For  example,  a  supervisor  may  need  to 
diagnose  relative  strengths  and  weaknesses  of  a  group  of 
subordinates  to  choose  training  programs  or  other  developmental 
activities  for  them.  These  decisions  require  accurate  summary 
evaluations  of  subordinates  on  the  attributes  of  performance. 

Finally,  the  most  Important  component  of  accuracy  Is 
differential  accuracy  (Cronbach,  1955).  The  target  ratings  for 
each  ratee  are  compared  to  the  performance  ratings  given  by  the 
rater.  Differential  accuracy  reflects  the  raters  ability  to 
rate  the  Individual  ratee  accurately.  In  an  organizational 
setting,  differential  accuracy  Is  Important  for  research  purposes 
and  for  developing  employees.  Most  research  projects  utilize  the 
performance  ratings  of  Individuals,  necessitating  that  each  ratee 
be  descrl  bed  wi  th  1  Ittl  e  dl  stortl  on  1  n  the  rati  ngs .  Empl  ovee 
development  requires  accurate  feedback  about  an  individuals 
performance,  so  that  changes  that  are  undertaken  for  improvement 
are  appropriate  to  the  Individual . 

Conputatl  ons 

The  computations  for  the  accuracy  statistics  were  presented 
by  Cronbach  (1955)  in  his  seminal  article.  These  statistics  are 
oriented  to  the  descriptions  of  the  accuracy  of  each  rater,  and 
so,  the  underlying  research  design  Is  not  emphasized.  Indeed, 
subsequent  research  studies  have  utilized  the  accuracy  statistics 
as  measures  of  the  rater's  "ability"  to  perceive  others  (e.g., 
Borman,  1979b;  Cline  &  Richards,  1960;  Crow  &  Hammond,  195  7).  As 
a  consequence,  little  attention  has  been  given  to  the  basic 
design  underlying  person  perception  Investigations  and  Its 
extension  to  other  areas  of  research. 


Basic  Design 


Analysis  of  variance  can  be  used  to  partition  the  rating 
variance  obtained  In  person  perception  Investigations.  The  basic 
design  Includes  the  three  factors  of  rating  sources,  ratees,  and 
traits.  Table  6  dl spiels  the  seven  sources  of  variation  In  the 
basic  design,  and  summarizes  the  psychometric  interpretations  of 
these  sources. 


The  sources  for  ratings  are  the  rater  and  the  experts  who 
provided  the  target  ratings.  The  variation  In  ratings  accounted 
for  by  Rating  Sources  reflects  elevation  accuracy.  The  larger 
this  source  of  variation,  the  larger  the  difference  between  the 
overall  mean  rating  of  the  rater  and  that  of  the  experts,  and  the 
more  Inaccurate  Is  the  rater. 

The  Ratees  effect  Indicates  the  ability  of  the  rating 
sources  to  describe  differences  between  ratees  across  the 
attributes.  This  effect  can  be  due  to  the  rater,  the  expert 
source  for  the  target  ratings,  or  both.  Since  the  Investigator 
will  typically  select  the  ratees  to  differ  from  one  another  on 
the  attributes,  the  Ratees  effect  should  account  for  substantial 
variation  In  the  ratings.  However,  the  more  the  rater  agrees 
with  the  target  ratings,  the  greater  the  Ratees  effect.  The 
rater  who  Is  accurate  In  ordering  the  ratees,  compared  to  the 
expert  source,  enhances  the  convergent  validity  of  the  ratings. 

The  fixed  effect  of  Traits  reflects  the  relative  amounts  of 
the  performance  attributes  possessed  by  the  group  of  ratees.  The 
Investigator  deslgis  this  effect  Into  the  research  with  the 
choice  of  the  rating  context  and  the  selection  of  the  ratees. 

The  rating  context  usually  Includes  attributes  that  differ  In 
their  social  desirability  and,  consequently,  seme  attributes  will 
have  greater  value  to  the  rater  than  others.  Furthermore,  the 
ratees  who  are  selected  by  the  investigator  may  be  chosen  to  have 
unequal  amounts  of  the  attributes.  If  the  expert  source  for 
ratings  provides  target  ratings  that  confirm  the  Investigator’s 
Intentions,  It  follows  that  the  Traits  effect  Is  likely  to 
account  for  variation  In  the  ratings. 

The  Rating  Sources  x  Ratees  Interaction  reflects 
differential  elevation  accuracy,  and  It  Indicates  differential 
ordering  of  the  ratees  by  the  rater,  compared  to  the  expert 
source  for  ratings.  This  differential  ordering  is  undesirable. 


An  accurate  rater  should  order  the  ratees  In  a  manner  similar  to 
that  ordering  provided  by  the  expert  source.  Since  the  expert 
source  serves  as  the  standard  for  defining  the  differences 
between  ratees,  the  effect  can  also  be  considered  a  reflection  of 
differential  convergent  validity.  A  rater  may  describe  more  or 
fewer  differences  between  the  ratees  in  assessing  their 
performance  on  the  set  of  attributes.  The  larger  the 
Interaction,  the  more  Inaccurate  Is  the  rater  In  ordering  the 
ratees . 

The  Rating  Sources  x  Traits  Interaction  indicates  the 
stereotype  accuracy  of  the  rater.  An  accurate  rater  should  agree 
with  the  expert  source  In  the  relative  amounts  of  the  attributes 
reflected  in  the  group  of  ratees.  The  larger  this  Interaction, 
the  more  Inaccurate  the  rater. 

The  Ratees  x  Traits  Interaction  reflects  the  extent  of 
Individual  differences  on  the  attributes  perceived  by  the  rater 
and  the  expert  source.  Since  the  researcher  should  select  the 
ratees  for  the  Investigation,  the  differential  ordering  of  the 
ratees  on  the  attributes  can  be  determined  by  the  researcher.  Of 
course,  this  assunes  that  the  target  ratings  are  close  to  the 
Intended  performance  scores  for  the  ratees  (Borman,  1979a).  For 
example,  the  researcher  can  construct  videotapes  of  actors  who 
play  ratees.  Then,  the  performance  of  ratees  can  be  acted  out  In 
a  manner  which  represents  scaled  amounts  of  the  attributes.  If 
the  Investigator  selects  ratees  who  differ  In  their  ordering  on 
the  traits,  then  discriminant  validity  will  explain  variation  In 
the  ratings.  Moreover,  the  more  the  rater's  ratings  match  those 
of  the  expert  source,  the  stronger  will  be  the  interaction,  and 
the  more  accurate  will  be  the  rater. 

The  Rating  Sources  x  Ratees  x  Traits  interaction  reflects 
the  differential  accuracy  of  the  rater.  This  is  the  ability  of 
the  rater  relative  to  the  expert  source  to  describe  Individual 
differences  among  the  ratees.  This  Interaction  Is  undesirable. 
The  rater  who  Is  accurate  should  agree  with  the  expert  source  on 
the  differences  among  the  ratees.  If  the  rater  disagrees  with 
the  expert  source,  the  rater  will  possess  more  or  less 
discriminant  validity  than  the  expert  source.  Since  the  target 
ratings  serve  as  a  standard,  this  differential  discriminant 
val  1d1  ty  Is  undesl rabl e. 

Ccmputati  ons 

The  sims  of  squares  that  are  obtained  from  the  analysis  of 
the  variance  In  ratings  are  closely  related  to  the  accuracy 
statistics  developed  by  Cronbach  (1955).  The  accuracy  statistics 
are  contrasts  between  effects  In  the  analysis  of  variance  design. 


Each  accuracy  statistic  Is  a  contrast  of  effects  of  the  rater  to 
those  of  the  expert  source  for  ratings.  Of  course,  an  effect  Is 
a  linear  combination  of  means,  and  such  combinations  are  used  to 
compute  sums  of  squares  in  the  design. 

Combination  Design 

The  person  perception  design  for  the  Investigation  of 
accuracy  can  be  combined  with  the  multi trait-multi method  design. 
The  combined  design  Includes  the  four  factors  of  rating  sources, 
ratees,  traits,  and  methods.  In  essence,  the  person  perception 
design  has  been  expanded  to  Include  more  than  one  method  for 
obtaining  performance  ratings,  while  the  multi trait-multi method 
design  has  been  expanded  to  Include  more  than  one  source  for  the 
ratings.  As  shown  In  Table  7,  the  combined  design  includes  the 
sources  and  psychometric  Interpretations  of  each  separate  design, 
as  well  as  several  other  sources,  with  their  psychometric 
Interpretations. 


The  Rating  Sources  x  Ratees  x  Methods  Interaction  reflects 
the  differential  ordering  of  the  ratees  provided  by  the  rater 
using  the  designated  methods  for  rating  compared  to  the  ordering 
provided  by  the  expert  source  using  the  same  methods.  This 
differential  ordering  Is  undesirable.  Regardless  of  the  method 
for  rating,  an  accurate  rater  should  order  the  ratees  In  a  manner 
similar  to  that  ordering  provided  by  the  expert  source.  Of 
course,  the  expert  source  for  ratings  may  order  ratees 
differently  depending  on  the  method  for  rating.  Since  the  expert 
source  serves  as  the  standard  for  defining  the  differences 
between  ratees,  this  result  can  be  considered  a  method  bias  In 
the  target  ratings.  However,  a  logical  property  for  a  standard 
Is  that  It  not  contain  method  bias.  The  target  ratings  should 
serve  to  evaluate  the  rater's  ability  to  describe  ratees, 
regardless  of  the  method  for  rating.  Hopefully,  the  Investigator 
can  design  the  research  such  that  the  target  ratings  will  be 
relatively  free  of  method  bias. 

The  Rating  Sources  x  Traits  x  Methods  interaction  indicates 
the  accuracy  of  the  rater  In  using  the  attributes  to  describe  the 
group  of  ratees  by  the  methods.  If  the  Investigator  designs  the 
research  such  that  the  target  ratings  contain  no  method  bias, 
this  Interaction  suggests  that  the  rater  uses  the  attributes  to 
describe  the  performance  of  the  group  differently  with  each 
method  for  rating.  This  Interaction  Is  again  undesirable.  No 
component  of  a  rater's  accuracy  should  depend  on  the  method  for 


rating. 


Finally,  the  Ratees  x  Traits  x  Methods  interaction  reflects 
the  influence  of  the  method  for  the  ordering  of  ratees  on  the 
attributes  summed  over  the  rater  and  the  expert  source.  This 
interaction  is  also  undesirable.  The  ordering  of  ratees  on  the 
traits  should  not  depend  on  the  method  for  obtaining  ratings.  If 
the  investigator  designs  the  research  such  that  the  expert  source 
orders  the  ratees  on  the  traits  in  the  same  manner,  regardless  of 
the  method  used,  the  interaction  is  determined  by  the  rater's 
Inability  to  use  the  methods  similarly.  This  differential  use 
of  the  methods  reflects  differential  discriminant  validity  by  the 
rater,  and  It  Indicates  Inaccuracy  by  the  rater.  The  rater 
should  order  the  ratees  on  the  attributes  regardl  ess  of  the 
method  that  the  rater  uses  for  making  ratings. 

Exampl  e 

Consider  an  extension  of  the  Issue  of  the  choice  of  a  method 
for  obtaining  performance  ratings.  Although  methods  should  be 
compared  in  terms  of  the  Individual  differences  in  ratings  that 
each  method  accounts  for,  another  aspect  Is  the  accuracy  with 
which  the  rater  can  use  the  methods  to  describe  Individual 
differences  in  the  ratings.  The  mul  ti  trai  t-mul  timet  hod  design 
assumes  that  a  method  Is  to  be  preferred  if  it  Influences  the 
ordering  of  ratees  less  than  other  methods.  The  combination 
design  extends  the  assumption  to  consider  accuracy.  A  method  is 
preferred  if  the  rater  can  use  it  to  obtain  greater  agreement 
with  the  expert  source  for  ratings. 

To  expand  the  example,  suppose  that  the  investigator 
developed  the  scripts  and  videotapes  in  a  series  of  workshops 
with  a  group  of  experts.  The  experts  were  highly  familiar  with 
the  performance  of  test  administrators.  Scripts  were  modified 
and  actors  changed  their  performance  until  the  experts  were  in 
high  agreement  in  their  ordering  of  the  ratees  with  each  method 
for  rating.  In  sum,  the  investigator  designed  the  target  ratings 
to  contain  no  method  bias. 

The  investigator  employed  the  combination  design  to  evaluate 
the  accuracy  of  several  raters.  The  results  of  the  analysis  of 
the  ratings  that  were  obtained  from  one  rater  are  shown  In  Table 
8.  The  data  in  Table  2  were  used  for  this  analysis. 

Furthermore,  assume  that  the  investigator  only  collected  five 
ratings  from  each  rating  source.  Suppose  that  the  expert  source 
and  rater  each  viewed  and  rated  the  same  set  of  five  videotapes 
selected  randomly  from  the  set  of  10. 


The  results  of  the  research  Indicate  that  the  rater  was 
fairly  accurate.  Elevation  and  differential  accuracy  accounted 
for  little  variation  In  the  ratings;  both  were  not  statistically 
significant.  The  mean  of  the  performance  ratings  given  by  the 
rater  for  the  group  of  ratees  on  the  set  of  attributes  compared 
favorably  to  the  mean  provided  by  the  expert  source. 

Importantly,  the  rater  agreed  for  the  most  part  with  the  expert 
source  on  the  differences  among  the  ratees.  The  Rating  Sources  x 
Ratees  x  Traits  Interaction  was  negligible  In  magnitude,  which 
suggests  discriminations  by  the  rater  comparable  to  those  by  the 
expert  source. 

The  results  do  suggest  some  Inaccuracies  by  the  rater. 
Differential  elevation  accuracy  and  stereotype  accuracy  were  both 
statistically  si gilf  leant.  For  most  ratees,  the  rater  and  expert 
source  agreed  on  Individual  differences  across  the  set  of 
attributes,  regardless  of  method.  However,  one  of  the  test 
administrators  was  given  a  much  greater  mean  rating  by  the  expert 
source.  This  ratee  was  the  only  female  actor  to  play  a  test 
administrator,  and  the  Investigator  suspects  that  sex  may  explain 
the  greater  mean  rating.  Perhaps,  the  rater  was  prejudiced 
against  female  test  administrators.  The  Rating  Sources  x  Traits 
Interaction  Indicated  that  the  rater  did  not  perceive  the  traits 
similarly  to  the  expert  source.  In  particular,  trait  nunber  two 
was  rated  as  si  gilf  leant!  y  less  prevalent  by  the  rater.  This 
trait  reflects  the  "coldness"  versus  "warmth?  of  the  test 
administrator,  and  the  Investigator  suspects  that  the  rater  was 
Insensitive  to  that  attribute  of  test  administration. 

The  Investigator  was  quite  pleased  that  the  method  for 
rating  had  little  Influence  on  the  ratings.  There  was  no  method 
bias  In  ordering  the  ratees  shown  by  the  rater  or  the  expert 
source.  The  Invest! gator  was  successful  In  eliminating  method 
bias  In  the  target  ratings,  and  utilizing  the  set  of  attributes, 
the  rater  was  able  to  order  accurately  the  group  of  ratees.  For 
this  rater,  at  least,  the  Investigator  Is  confident  that  either 
method  for  rating  performance  can  be  used  to  obtain  accurate 
ratings  of  performance.  Nonetheless,  the  Investigator  does 
recognize  that  the  ratings  obtained  with  the  exampl e- anchored 
method  cannot  be  compared  In  absolute  size  to  those  obtained  with 
the  checklist  method.  The  Traits  x  Methods  Interaction  suggests 
considerable  scale  bias  In  measuring  the  attributes. 


V.  DISCUSSION 


Several  models  of  the  rating  process  outline  variables  that 
influence  the  accuracy  of  ratings  {DeCotiis  &  Petit,  1978; 
Kavanagh,  Borman,  Hedge,  &  Gould,  1986;  Landy  &  Farr,  1980). 
However,  none  emphasizes  the  Influence  of  logical  requirements 
for  performance  measures  on  accuracy.  The  research  studies  that 
support  the  models  have  evaluated  accuracy  statistics  against 
rater  attributes  such  as  personality  and  training  experience. 
These  studies  Illustrate  a  myopic  research  strategy  (Cronbach, 
1955).  They  are  not  connected  to  meaningful  theory  about  the 
logical  requirements  for  performance  measures. 

The  combination  design  can  provide  a  broader  strategy  for 
accuracy  research.  It  emphasizes  the  assessment  of  accuracy  in 
the  framework  of  logical  requirements  for  performance  measures. 
The  Investigator  can  determine  conditions  under  which  ratings  are 
obtained  including  contextual  factors,  ratees,  traits,  methods 
for  rating,  and  sources  for  target  ratings.  These  conditions 
allow  the  Investigator  to  design  the  amounts  of  multitrait- 
multimethod  properties  into  the  target  ratings.  Such  logical 
requirements  provide  a  rich  framework  for  interpreting  the 
accuracy  of  performance  ratings. 

Target  ratings  should  be  designed  to  possess  the  multi  trait- 
multimethod  properties  fouid  In  practice.  For  example,  criterion 
research  consistently  shows  that  job  performance  Is  a 
multidimensional  concept  (Landy  S  Tr umbo,  1980).  There  are  many 
routes  to  success  In  most  work  contexts  and,  so,  several 
attributes  are  necessary  to  describe  performance.  Consequently, 
the  Investigator  must  design  the  target  ratings  to  possess 
discriminant  validity. 

Several  points  are  Important  to  consider  in  the  design.  The 
discriminant  validity  of  the  target  ratings  should  be 
representative  of  the  rating  context  so  that  accuracy  findings 
generalize  beyond  the  particular  research  setting.  Brunswikrs 
(1956)  via*  of  representati ve  desigi  underscores  this  point. 
Unfortunately,  representative  designs  are  apt  to  be  expensive. 
Most  accuracy  studies  have  used  videotapes  of  four  to  eight 
ratees  who  are  each  rated  on  six  to  12  dimensions  (e.g.,  Borman, 
1977,  1979a;  Hedge,  1982;  McIntyre  et  al  .,  1984;  Murphy  et  al  ., 
1982).  Such  small  combinations  of  ratees  and  dimensions  restrict 
the  amount  of  discriminant  validity  that  can  be  designed  Into  the 
target  ratings  and,  therefore,  restrict  the  generality  of  the 
research  findings. 

The  combination  design  can  be  expanded  to  consider  the  broad 


scope  of  research  on  performance  ratings.  Multiple  raters  can  be 
included  in  the  Rating  Sources  so  as  to  include  rater 
characteristics  such  as  sex,  race,  ability,  and  moti vati on. 
Effects  coding  of  the  raters  against  the  expert  source  will 
provide  the  statistics  for  each  rater  that  are  contained  in  the 
combination  design  (Kerllnger  &  Pedhazur,  1973).  Ratee 
characteristics  can  also  be  studied  in  the  combination  design. 
Videotapes  of  actors  can  be  constructed  whose  target  ratings  are 
Identical  but  who  differ  In  characteristics  such  as  age,  sex,  and 
race.  Furthermore,  manipulations  of  ratee  and  rater 
characteristics  in  the  same  design  address  important  legal 
questions  about  equal  employment  opportunity  and  the  quality  of 
performance  ratings  (Cascio  &  Bernardln,  1981).  Finally, 
contextual  factors  can  be  evaluated  for  their  impact  on  the 
accuracy  of  ratings.  Factors  that  can  be  studied  Include  the 
Intended  use  of  the  ratings  (McIntyre  et  al.,  1984),  the  content 
of  the  attributes  (Kavanagh,  1971),  and  the  feedback  given  to 
raters  on  their  accuracy  (II  gen,  Fisher,  &  Taylor,  1979). 

VI.  REOOMENDATIONS 

No  research  study  has  Implemented  the  combination  design  to 
Investigate  both  the  validity  and  accuracy  of  performance 
ratings.  The  design  provides  a  rich  framework  for  understanding 
the  distortions  in  performance  ratings  and  can  Identify  factors 
to  control  or  remove  to  Improve  ratings.  To  date  most  research 
on  the  acctracy  of  ratings  has  focused  on  the  training  of  raters 
to  becane  more  accurate  in  their  ratings.  Several  programs  have 
been  used  to  train  raters  to  make  more  accurate  ratings.  This 
line  of  research  should  continue;  however.  It  must  be  expanded  to 
address  the  Influence  of  logical  requirements  for  performance 
measures  on  acciracy  training.  For  example,  a  study  should  be 
uidertaken  to  consider  the  Impact  of  acciracy  training  on 
performance  measures  that  differ  in  their  amounts  of  discriminant 
validity.  The  combination  deslji  developed  in  this  paper 
provides  an  encompassing  research  strategy  for  future  studies  to 
evaluate  the  validity  and  accuracy  of  performance  ratings. 
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Table  1.  Summary  Table  for  the  Psychometric  Interpretations  of  the 
Basic  Multi  tralt-Multl method  Design 


Source 

Psychometric  Interpretation 

Traits  (T) 

Trait  Bias 

Methods  (M) 

Scale  Bias 

T  x  M 

Trait  by  Scale  Bias 

Ratees  (R) 

Convergent  Validity 

R  x  T 

Discriminant  Validity 

R  x  M 

Method  Bias 

Error 

Sampling  and  Measurement  Errors 

*■ 

i 

>< 
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Table  2.  Example  Data  for  Basic  Multitrait-Multimethod  Design 


.* 


i 


1 


Methods 


2 


Test 

attain istrators 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Traits 

1  2  3 

4  7  2 

3  5  1 

7  9  6 

6  6  2 

5  5  1 

8  2  5 

4  1  1 

6  3  4 

7  5  2 

7  1  1 


Traits 

1  2  3 

3  6  3 

3  5  4 

6  8  6 

4  5  3 

4  4  4 

5  5  7 

3  4  5 

6  2  2 

8  6  4 

4  2  2 


Note.  Trait  1  is  maintaining  procedures;  Trait  2  is  gaining 
rapport;  and  Trait  3  Is  presenting  instructions.  Method  1  is 
example-anchored,  and  Method  2  is  checklist. 


Table  3.  Summary  Table  for  the  Analysis  of  the  Data  for  the  Basic 

Mul titrait-Multi method  Design 


Source 

df 

MS 

F-Ratio 

VC 

ICC 

Traits  (T) 

2 

18.87 

4.43* 

.49 

.10 

Methods  (M) 

1 

.82 

.55 

-.01 

.00 

T  x  M 

2 

8.47 

7.84** 

.24 

.05 

Ratees  (R) 

9 

9.57 

8.86** 

1.42 

.29 

R  x  T 

18 

4.26 

3.94** 

1.59 

.32 

R  x  M 

9 

1.48 

1.37 

.13 

.03 

Error 

18 

1.08 

1.08 

Note.  If  a  source's  variance  component  was  negative,  that  value 
was  used  in  the  denominator  to  compute  intraclass  correlation 
coefficients,  but  the  source's  coefficient  was  set  to  zero.  VC, 
variance  component;  ICC,  Intraclass  correlation  coefficient. 

*p  <  .05.  **p  <  .01. 
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Table  4.  Summary  Table  for  Psychometric  Interpretations  of  the 
One-Factor  Design  Beyond  the  Multitrait-Multimethod  Design 


Source  Psychometric  interpretation 


Purposes  (P) 
Ratees  (K)/P 
Traits  (T) 

T  x  P 
T  x  R/P 
Methods  (M) 

M  x  P 
M  x  R/P 
T  x  M 
T  x  M  x  P 


Research  Conditions 

Convergent  Validity  Within  Research  Conditions 
Trait  Bias 

Trait  Bias  by  Research  Conditions 
Discriminant  Validity  Within  Research  Conditions 
Scale  Bias 

Scale  Bias  by  Purpose 

Method  Bias  Within  Research  Conditions 

Trait  by  Scale  Bias 

Trait  by  Scale  Bias  by  Research  Conditions 
Measurement  and  Sampling  Errors 


Error 


Table  5.  Summary  Table  for  Analysis  of  Data  for  One-Factor  Design 
Beyond  the  Multitrait-Multi method  Design 


Source 

df 

MS 

F-Ratio 

VC 

ICC 

Purposes  (P) 

1 

3.75 

.36 

-.11 

Ratees  (R)/P 

8 

10.30 

11.32* 

1.56 

.34 

Traits  (T) 

2 

18.87 

10.14* 

.57 

.12 

T  x  P 

2 

23.40 

12.58* 

.71 

.15 

T  x  R/P 

16 

1.86 

2.04 

.48 

.10 

Methods  (M) 

1 

.82 

.55 

-.01 

M  x  P 

1 

1.35 

.90 

M  x  R/P 

8 

1.50 

1.65 

.20 

.04 

T  x  M 

2 

8.47 

9.31* 

.25 

.05 

T  x  M  x  P 

2 

2.40 

2.64 

.05 

.01 

Error 

16 

.91 

.91 

Note. 

If  a 

source's  variance  component  was  negative. 

that  value 

was  used  in  the  denominator  to  compute  intraclass  correlation 
coefficients,  but  the  source's  coefficient  was  set  to  zero.  VC, 
variance  component;  ICC,  Intraclass  correlation  coefficient. 


Table  6.  Summary  Table  for  the  Psychometric  Interpretations  of  the  Basic 

Accuracy  Design 


Source 

Psychometric  interpretation 

Rating  Sources  (S) 

Elevation  Accuracy 

Ratees  (R) 

Convergent  Validity 

Traits  (T) 

Trait  Bias 

S  x  R 

Differential  Elevation  Accuracy 
(Differential  Convergent  Validity 
by  Rating  Sources) 

S  x  T 

Stereotype  Accuracy 

R  x  T 

Discriminant  Validity 

S  x  R  x  T 

Differential  Accuracy 
(Differential  Discriminant  Validity 
by  Rating  Sources) 

I 

! 

OC 


Table  7.  Summary  Table  of  Psychometric  Interpretations  of  the 

Combination  Design 


Source 

Psychometric  interpretation 

Rating  Sources  (S) 

Elevation  Accuracy 

Ratees  (R) 

Convergent  Validity 

Traits  (T) 

Trait  Bias 

Methods  (M) 

Scale  Bias 

S  x  R 

Differential  Elevation  Accuracy 
(Differential  Convergent  Validity  by 
Rating  Sources) 

S  x  T 

Stereotype  Accuracy 

S  x  M 

Differential  Scale  Bias  by  Rating 

Sources 

R  x  T 

Discriminant  Validity 

R  x  M 

Method  Bias 

T  x  M 

Trait  by  Scale  Bias 

S  x  R  x  T 

Differential  Accuracy  (Differential 
Discriminant  Validity  by  Rating  Sources) 

S  x  R  x  M 

Differential  Elevation  Accuracy  by 
(Differential  Method  Bias  by  Rating 
Sources) 

S  x  T  x  M 

Differential  Stereotype  Accuracy  by 
Methods 

R  x  T  x  M 

Differential  Discriminant  Validity  by 
Methods 

Error 


Measurement  and  Sampling  Errors 


Table  8.  Summary  Table  for  the  Analysis  of  the  Data  for  the 

Combination  Design 


Source 

df 

MS 

F-Ratio 

VC 

ICC 

Rating  Sources 

(S)  1 

3.75 

.40 

-.18 

o 

o 

• 

Ratees  (R) 

4 

11.31 

1.22 

.17 

.03 

Traits  (T) 

2 

18.87 

.80 

-.18 

.00 

Methods  (M) 

1 

.82 

.49 

-.03 

.00 

S  x  R 

4 

9.29 

12.39* 

1.42 

.25 

S  x  T 

2 

23.40 

15.60* 

2.19 

.39 

S  x  M 

1 

1.35 

1.52 

.03 

.01 

R  x  T 

8 

2.22 

1.48 

.18 

.03 

R  x  M 

4 

2.11 

2.37 

.20 

.04 

T  x  M 

2 

8.47 

2.66 

.19 

.03 

S  x  R  x  T 

8 

1.50 

2.00 

.38 

.07 

S  x  R  x  M 

4 

.89 

1.19 

.05 

.01 

S  x  T  x  M 

2 

2.40 

3.20 

.33 

.06 

R  x  T  x  M 

8 

1.07 

1.43 

.16 

.03 

Error 

8 

.75 

.75 

Note.  If  a  source's  variance  component  was  negative,  that  value 
was  used  in  the  denominator  to  compute  Intraclass  correlation 
coefficients,  but  the  source's  coefficient  was  set  to  zero.  VC, 
variance  component;  ICC,  Intraclass  correlation  coefficient. 

*p  <  .01. 
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